Tuesday, June 27, 2006

Optimal histogram bin width

Kevin Knuth wrote a paper about finding the optimal number of bins to represent data in a histogram (Optimal Data-Based Binning for Histograms). He starts from a piecewise constant density model and finds the (Bayesian) posterior probability from this model (equation 36, which is actually the log of the posterior). The posterior function is then maximized to find the number of bins that best models the data.


The article also investigates the number of data points for a reliable estimation of the density. The recommendation is 100-150 points, if the distribution is Gaussian.


It would be interesting to apply this method to radial distribution functions. However the assumption of a constant volume for each bin is not met in this case. There are several ways this could be adjusted, but I'm not sure they are valid (scale each bin count by the volume, or use non-uniform bin spacing to maintain constant volume)


Alternately, the discussion references other algorithms for dealing with variable bin-width models (which may be better for resolving multiple peaks anyway).

No comments: